Skip to content

ci: run tests on GPU server#747

Merged
MauroToscano merged 26 commits into
mainfrom
ci_run_tests_gpu
Jul 3, 2026
Merged

ci: run tests on GPU server#747
MauroToscano merged 26 commits into
mainfrom
ci_run_tests_gpu

Conversation

@JuArce

@JuArce JuArce commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

What

Adds a GPU test suite that runs in the merge queue and blocks the merge if it fails. It rents an RTX 5090 on Vast.ai, runs the CUDA tests there, and always destroys the box — reusing the rent → provision → run → always-destroy orchestration from benchmark-gpu.yml.

Why

CPU CI (pr_main.yaml) runs on GitHub ubuntu-latest runners, which have no GPU, so the CUDA backend is never exercised before merge. This suite runs the GPU-relevant tests on real hardware.

Test groups (run by scripts/gpu_test.sh via Makefile targets)

# Target Covers
1 test-math-cuda GPU↔CPU kernel parity (NTT, LDE, barycentric, FRI, merkle, ext3, keccak, …)
2 test-cuda-integration End-to-end GPU prove: every dispatch fires + the proof verifies
3 test-cuda-fallback GPU dispatch error → CPU fallback still produces a verifying proof
4 test-prover-cuda The lambda-vm-prover/stark/crypto/ecsm suite on the GPU path (CPU CI's sharded prover tests, run single-threaded with --features cuda)
5 test-prover-comprehensive-cuda The comprehensive all-instructions prove (test_prove_elfs_all_instructions_64_full) on the GPU path

Groups 1–3 are GPU-only (no CPU equivalent); 4–5 mirror CPU CI's prover jobs but with the CUDA path enabled. All groups run even if one fails; the job fails if any group does.

How it works

  • Trigger: merge_group (one GPU rental per merge, not per push) + workflow_dispatch. A failure ejects the PR from the queue.
  • The box checks out the merge ref and runs scripts/gpu_test.sh, which:
    • pins cudarc (CUDARC_PIN=cuda-12080) and verifies the GPU toolchain (prints full nvidia-smi),
    • builds the guest ELFs (compile-programs-asm + compile-programs-rust),
    • runs the 5 groups, aggregating failures.
  • Machine info is also printed in its own dedicated step (nvidia-smi, nvcc --version, CPU/RAM) before the test run, so the hardware is easy to find without scrolling the large test log.
  • Teardown always destroys the instance (--yes, 3× retry, plus a label-based sweep so a cancelled-mid-rent box can't leak).
  • Reporting: full stdout+stderr in the step log; the job summary lists the box specs, the failed tests grouped by suite, and the panic/assertion message for each.

Provisioning: price bands + retry across offers

The provisioning is a single Provision instance (retry across offers) step (ideas borrowed from IDP's create-vast-server.yml), designed so a slow or flaky host can't stall the merge gate:

  • Escalating price bandsPRICE_THRESHOLDS: "0.6 0.8 1.0" ($/hr, ascending). It picks the priciest (most reliable) offer in the cheapest band that has capacity, only climbing to a pricier band when a lower one is empty. The last value ($1) is the hard cap.
  • Retry across distinct offers — up to MAX_TRIES: 5 different physical hosts. Each host gets a PROVISION_TIMEOUT: 600s readiness budget (running + sshd accepting our key + onstart bootstrap finished); if it isn't ready in time, the box is destroyed and a different offer is tried.
  • Tried-host exclusion — every attempted host's machine_id is tracked and excluded from later scans, so a flaky host is never re-picked.
  • Fast-fail on unrecoverable states — an offer whose status_msg reports an image that can't be pulled / a host that can't meet the ask ("not started loading", "cannot be met", daemon errors) is abandoned immediately instead of burning the full 600s budget.
  • Transient scarcity — within each attempt, all bands are re-scanned up to OFFER_ATTEMPTS: 10 times (30s apart) before giving up, to ride out the small, fast-churning RTX 5090 pool.
  • Offer filter: RTX 5090, ≥16 cores, ≥48 GB RAM, ≥64 GB disk, cuda_max_good>=13.1, driver major ≥ 580.
  • Each swapped-out box is destroyed in-step (its recorded id cleared), and the final ready box is recorded for the always-run teardown — so no box leaks across the retry loop.

CUDA version requirement

An NVIDIA driver supporting CUDA ≥ 13.1 is required: the kernels are compiled with the toolkit's nvcc into PTX that the driver JIT-compiles at load — an older driver rejects it with CUDA_ERROR_UNSUPPORTED_PTX_VERSION. Enforced by the cuda_max_good>=13.1 offer filter and documented in the README "GPU Tests" section + crypto/math-cuda/build.rs.

Files

  • .github/workflows/gpu-tests.yml — new workflow (job gpu-tests).
  • scripts/gpu_test.sh — new on-box runner (cudarc pin + compile + 5 groups, aggregate exit).
  • Makefile — adds test-cuda-fallback, test-prover-cuda, test-prover-comprehensive-cuda (groups 1 & 2 targets already existed).
  • README.md — GPU Tests section + targets table.
  • crypto/math-cuda/build.rs — comment clarifying the PTX/driver version coupling.

Before this gates merges (manual GitHub settings — admin, not code)

on: merge_group: only makes the workflow eligible to run in the queue.

  1. Add gpu-tests to the required status checks for main (Settings → Branches/Rulesets). Otherwise the job runs in the queue but a failure won't block the merge. Current required checks are only Lint and Test.

Sequencing: the gpu-tests check usually needs to have run once (via workflow_dispatch, or after this merges to main) before it appears in the required-checks picker. merge_group runs the workflow from the merge-commit ref (base + PR), same pattern as the existing merge-queue-only test-prover-comprehensive job.

Repo secrets VAST_API_KEY and VAST_TEMPLATE_HASH are already set.

Status

All 5 groups pass end-to-end on a rented RTX 5090. One real test-harness bug was fixed along the way: test-cuda-fallback ran its two tests in parallel, racing on the process-global GPU dispatch counters (assert_eq! count mismatches) — fixed by adding --test-threads=1 to that Makefile target.

Notes

  • Tradeoff: as a merge gate, a persistent Vast outage (no ready box after MAX_TRIES offers across all price bands) would block merges until it clears; the price-band climb, per-host readiness budget + retry-across-offers, and re-scan-for-scarcity loops mitigate it.

@JuArce JuArce self-assigned this Jun 30, 2026
@JuArce JuArce marked this pull request as ready for review June 30, 2026 19:17
@JuArce JuArce added the ai-review Trigger the AI review label Jun 30, 2026
@github-actions

Copy link
Copy Markdown

Codex Code Review

No findings.

I reviewed the PR diff only: new GPU merge-queue workflow, GPU test script, Make targets, README additions, and the math-cuda build comment. I did not identify concrete safety/security, VM semantics, GPU resource, correctness, or significant performance issues in the changed code.

Verification was static only per instructions; I did not build or run tests.

Comment thread scripts/gpu_test.sh Outdated
Comment thread scripts/gpu_test.sh Outdated
@claude

claude Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Review: GPU test suite in merge queue

This is an infra/CI PR (workflow + Makefile targets + scripts + docs). I verified the Makefile targets reference real features and test files (test-cuda-faults, cuda_fallback_tests, cuda_path_integration all exist), and the orchestration is well-built. Notable strengths: secrets kept out of the pip-install step, pinned Vast CLI commit, ephemeral per-instance SSH key, ref validation before remote bash -lc, pipefail so test failures aren't masked by tee, and a robust always-destroy teardown with a label-based leak sweep.

No safety, correctness, or performance issues found. Two minor readability nits (posted inline):

  • Low — stale comment: scripts/gpu_test.sh:12 says "All three groups run" but five groups now run.
  • Low — stale comment: scripts/gpu_test.sh:42 references a cuda_max_good>=12.8 offer floor; the new workflow's floor is 13.1. The 12.8 cudarc pin itself is correct — it's the host-side symbol floor and is independent of the PTX/driver-13.1 requirement — but the comment conflates the two.

Both are cosmetic; the workflow is sound. Note the PR's own caveat stands: as a required merge gate, a sustained Vast.ai outage (no offer after the retry loop) would block merges until it clears — an accepted tradeoff per the description.

@MauroToscano

Copy link
Copy Markdown
Contributor

/bench-growth

@github-actions

Copy link
Copy Markdown

Benchmark — ethrex 20 transfers (median of 3)

Table parallelism: auto (cores / 3)

Metric main PR Δ
Peak heap 73337 MB 72870 MB -467 MB (-0.6%) ⚪
Prove time 40.250s 40.084s -0.166s (-0.4%) ⚪

✅ No significant change.

🔬 Looks like a small speedup (-0.4%) — below what 3 runs can confirm. Comment /bench-abba to run the drift-free ABBA tiebreaker (paired-t CI + exact Wilcoxon). Note: it occupies the bench server for ~30–40 min.
Optional pair count: /bench-abba 32 (20 resolves ~1%, 32 for ~0.6%).

✅ Low variance (time: 2.8%, heap: 1.9%)

Memory Growth

ethrex distinct-account transfers · default parallelism · 1 sample per point

Transfers main (MB) PR (MB) Δ
4 24560 24844 +284 MB (+1.2%)
8 38026 37895 -131 MB (-0.3%)
12 50757 48947 -1810 MB (-3.6%)
16 60593 61176 +583 MB (+1.0%)
20 72309 73056 +747 MB (+1.0%)

Growth rate: 2993 MB / transfer (main: 2952, Δ: +1.4%)
Fit: R² = 0.9995 (main: 0.9969)

✅ No significant change in memory scaling.

Commit: 1d08005 · Baseline: cached · Runner: self-hosted bench

@JuArce JuArce marked this pull request as draft June 30, 2026 19:48
@github-actions

Copy link
Copy Markdown

AI Review

PR #747 · 5 changed files

Findings

Status Sev Location Finding Found by
candidate low .github/workflows/gpu-tests.yml:236 Hardcoded CUDARC_PIN/SYSROOT_DIR in the SSH remote string duplicate the script defaults minimax
minimax/MiniMax-M3
candidate low scripts/gpu_test.sh:42 Stale comment in scripts/gpu_test.sh still references the old cuda_max_good>=12.8 offer floor minimax
minimax/MiniMax-M3

Status column reflects the verdict from the verifier: deepseek-verifier (openrouter/deepseek/deepseek-v4-pro).

AI-001: Hardcoded CUDARC_PIN/SYSROOT_DIR in the SSH remote string duplicate the script defaults
  • Status: candidate
  • Severity: low
  • Location: .github/workflows/gpu-tests.yml:236
  • Found by: minimax:minimax/MiniMax-M3
  • Verified by: -
  • Rejected by: -

Claim

The REMOTE string passed over SSH hardcodes CUDARC_PIN=cuda-12080 SYSROOT_DIR=/opt/lambda-vm-sysroot, while the same defaults are already declared inside scripts/gpu_test.sh (lines 22-23). If the script defaults change, the workflow will silently disagree with the script.

Evidence

.github/workflows/gpu-tests.yml line 236: CUDARC_PIN=cuda-12080 SYSROOT_DIR=/opt/lambda-vm-sysroot bash scripts/gpu_test.sh. scripts/gpu_test.sh lines 22-23: CUDARC_PIN="${CUDARC_PIN:-cuda-12080}" and export SYSROOT_DIR="${SYSROOT_DIR:-/opt/lambda-vm-sysroot}". Both default to the same values today, but the duplication means a future bump in either default must be made in two places.

Suggested fix

Drop the inline env from the SSH command so the script's defaults are the single source of truth, e.g. ... bash scripts/gpu_test.sh. Override only when intentionally diverging from the defaults.

AI-002: Stale comment in scripts/gpu_test.sh still references the old `cuda_max_good>=12.8` offer floor
  • Status: candidate
  • Severity: low
  • Location: scripts/gpu_test.sh:42
  • Found by: minimax:minimax/MiniMax-M3
  • Verified by: -
  • Rejected by: -

Claim

The explanatory comment about cudarc pinning says the pin "matches the cuda_max_good>=12.8 offer floor", but the new .github/workflows/gpu-tests.yml query now requires cuda_max_good>=13.1 (raised because the base image's nvcc is 13.1 and the build emits 13.1 PTX). This is stale documentation drift introduced by this PR.

Evidence

The comment on line 42 of scripts/gpu_test.sh reads: "# CUDA version (12.8, matching the cuda_max_good>=12.8 offer floor) avoids that." The new workflow in .github/workflows/gpu-tests.yml line 85 uses cuda_max_good>=13.1 with a comment explaining "the box's driver must support CUDA 13.1 because the template's nvcc is 13.1". The script's claim is no longer accurate and will mislead anyone reading the script after the offer floor moves again.

Suggested fix

Update the comment to either: (a) drop the offer-floor reference and just explain that the pin is a known-symbol set older than the newest cudarc; or (b) reference cuda_max_good>=13.1 consistently with the workflow query.

Reviewer Lanes

Lane Model Prompt Status Findings
glm openrouter/z-ai/glm-5.2 general error: opencode failed (provider/auth/runtime error) and no findings were submitted 0
kimi openrouter/moonshotai/kimi-k2.7-code general error: opencode failed (provider/auth/runtime error) and no findings were submitted 0
minimax minimax/MiniMax-M3 general success 2
moonmath zro/minimax-m3 general error: agentic lane timed out after 1800s 0
nemotron openrouter/nvidia/nemotron-3-ultra-550b-a55b general error: opencode failed (provider/auth/runtime error) and no findings were submitted 0

Verification Lanes

Lane Model Status Confirmed Rejected Uncertain
deepseek-verifier openrouter/deepseek/deepseek-v4-pro error: opencode failed (provider/auth/runtime error) and no verifications were submitted 0 0 0

Native Codex and Claude reviews run separately and post their own comments. They are not included in this structured provenance report.

Raw lane outputs, candidates, final issues, and model metrics are uploaded as workflow artifacts.

@JuArce JuArce marked this pull request as ready for review June 30, 2026 21:28
…, sed guard, doc nits (#777)

- gpu-tests.yml: capture wait_ready's exit code in an else branch (rc=$? after
  'if ...; fi' is always 0, so the timeout-vs-image-failure diagnostic never worked)
- gpu-tests.yml: add ServerAliveInterval/CountMax to the test ssh session so a box
  that goes dark mid-suite fails in ~10 min instead of eating the 240-min job budget
- gpu-tests.yml: upload the full test log as an artifact (the step log gets
  truncated in the UI for multi-hour runs, as the workflow itself notes)
- gpu_test.sh: guard the cudarc-pin sed anchors (a silent no-op after a math-cuda
  Cargo.toml refactor would resurrect the fallback-latest driver-symbol panic) and
  restore the mutated Cargo.toml on exit so manual runs don't leave the tree dirty
- Makefile: add test-prover-debug to .PHONY; correct the test-cuda-integration
  comment (the test asserts R1-R4 counters, not R1-R3); scope the comprehensive-cuda
  parity comment (CPU's merge-queue job also runs test_recursion_execute)
- docs/roadmap.md: GPU FFT / Merkle tree / FRI are implemented and CI-tested, not
  'Planned'
- prove_elfs_tests.rs: fix 'cargo test --ignored' -> 'cargo test -- --ignored' in
  the ignore comment
@MauroToscano MauroToscano merged commit 509fd3f into main Jul 3, 2026
1 check passed
@MauroToscano MauroToscano deleted the ci_run_tests_gpu branch July 3, 2026 21:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai-review Trigger the AI review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants